17.4 Extrinsic Methods
257
database sequences and the consequences for deductive inference do not yet seem
to have been given the attention they deserve. In particular, there appears to be a
feeling associated with the “big data” movement that, provided one has enough data,
the errors will somehow be “averaged out” or “autocompensated”, although proper
justification for this notion is lacking.
17.4.2
Sequence Comparison and Alignment
The pairwise comparison of sequences is very widely used in bioinformatics. Evi-
dently, it is a subset of the general problem of pattern recognition (Sect. 13.1). If it
were only a question of finding matches to more or less lengthy blocks of symbol
sequences (e.g., the longest common subsequence; LCS), the task would be relatively
straightforward and the main work would be merely to assess the statistical signifi-
cance of the result; that is, compare with the null hypothesis that a match occurred
by chance (cf. Sect. 9.2.1). In reality, however, the two sequences one is trying to
compare differ due to mutations, insertions, and deletions (cf. Sect. 14.7.1), which
renders the problem considerably more complicated; one has to allow for gaps, and
one tries to make inferences from local alignments between subsequences. A typical
example of an attempt to align fragments of two nucleotide sequences is
upper I left parenthesis s Subscript a Baseline comma s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline comma s Subscript a Baseline right parenthesis equals upper I left parenthesis s Subscript a Baseline right parenthesis minus upper I left parenthesis s Subscript a Baseline vertical bar s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline right parenthesis minus upper I left parenthesis s Subscript b Baseline vertical bar s Subscript a Baseline right parenthesis period
A
C
G
T
A
C
G
T
A
−
G
T
|
|
|
|
|
|
|
|
A
C
−
−
A
T
G
T
A
C
G
T
where vertical lines indicate matches. Note the gaps that have been inserted to increase
the number of matches. In the absence of gaps, one could simply compute the Ham-
ming distance between two sequences; the introduction of the possibility of gaps
introduces two problems: (i) the number of possible alignments becomes very large
and (ii) where are gaps to be placed in sequence space?
If no gaps are allowed, one assigns and sums scores for all possible pairs of aligned
substrings within the two sequences to be matched. If gaps are allowed, there areStartBinomialOrMatrix 2 n Choose n EndBinomialOrMatrix
(2n
n
)
possible alignments of two sequences each of length nn. 16 Even for moderate values
ofnn, there are too many possibilities to be enumerated (problem (i), a computational
one). It is solved using dynamic programming algorithms (Sect. 17.4.4). Problem
(ii) is solved by devising a scoring system with which gaps and substitutions can be
assigned numerical values. Finally, one needs to assess the statistical significance of
the alignment. This is still an unsolved problem—let us call it problem (iii).
The essence of sequence alignment is to assign a score, or cost, for each possible
alignment; the one with the lowest cost, or highest score, is the best one, and if
16 This is obtained by considering the number of ways of intercalating two sequences while pre-
serving the order of symbols in each.